5 research outputs found

    A large vocabulary online handwriting recognition system for Turkish

    Get PDF
    Handwriting recognition in general and online handwriting recognition in particular has been an active research area for several decades. Most of the research have been focused on English and recently on other scripts like Arabic and Chinese. There is a lack of research on recognition in Turkish text and this work primarily fills that gap with a state-of-the-art recognizer for the first time. It contains design and implementation details of a complete recognition system for recognition of Turkish isolated words. Based on the Hidden Markov Models, the system comprises pre-processing, feature extraction, optical modeling and language modeling modules. It considers the recognition of unconstrained handwriting with a limited vocabulary size first and then evolves to a large vocabulary system. Turkish script has many similarities with other Latin scripts, like English, which makes it possible to adapt strategies that work for them. However, there are some other issues which are particular to Turkish that should be taken into consideration separately. Two of the challenging issues in recognition of Turkish text are determined as delayed strokes which introduce an extra source of variation in the sequence order of the handwritten input and high Out-of-Vocabulary (OOV) rate of Turkish when words are used as vocabulary units in the decoding process. This work examines the problems and alternative solutions at depth and proposes suitable solutions for Turkish script particularly. In delayed stroke handling, first a clear definition of the delayed strokes is developed and then using that definition some alternative handling methods are evaluated extensively on the UNIPEN and Turkish datasets. The best results are obtained by removing all delayed strokes, with up to 2.13% and 2.03% points recognition accuracy increases, over the respective baselines of English and Turkish. The overall system performances are assessed as 86.1% with a 1,000-word lexicon and 83.0% with a 3,500-word lexicon on the UNIPEN dataset and 91.7% on the Turkish dataset. Alternative decoding vocabularies are designed with grammatical sub-lexical units in order to solve the problem of high OOV rate. Additionally, statistical bi-gram and tri-gram language models are applied during the decoding process. The best performance, 67.9% is obtained by the large stem-ending vocabulary that is expanded with a bi-gram model on the Turkish dataset. This result is superior to the accuracy of the word-based vocabulary (63.8%) with the same coverage of 95% on the BOUN Web Corpus

    Large vocabulary recognition for online Turkish handwriting with sublexical units

    Get PDF
    We present a system for large vocabulary recognition of online Turkish handwriting, using hidden Markov models. While using a traditional approach for the recognizer, we have identified and developed solutions for the main problems specific to Turkish handwriting recognition. First, since large amounts of Turkish handwriting samples are not available, the system is trained and optimized using the large UNIPEN dataset of English handwriting, before extending it to Turkish using a small Turkish dataset. The delayed strokes, which pose a significant source of variation in writing order due to the large number of diacritical marks in Turkish, are removed during preprocessing. Finally, as a solution to the high out-of-vocabulary rates encountered when using a fixed size lexicon in general purpose recognition, a lexicon is constructed from sublexical units (stems and endings) learned from a large Turkish corpus. A statistical bigram language model learned from the same corpus is also applied during the decoding process. The system obtains a 91.7% word recognition rate when tested on a small Turkish handwritten word dataset using a medium sized (1950 words) lexicon corresponding to the vocabulary of the test set and 63.8% using a large, general purpose lexicon (130,000 words). However, with the proposed stem+ending lexicon (12,500 words) and bigram language model with lattice expansion, a 67.9% word recognition accuracy is obtained, surpassing the results obtained with the general purpose lexicon while using a much smaller one

    A comparative study of delayed stroke handling approaches in online handwriting

    No full text
    Delayed strokes, such as i-dots and t-crosses, cause a challenge in online handwriting recognition by introducing an extra source of variation in the sequence order of the handwritten input. The problem is especially relevant for languages where delayed strokes are abundant and training data are limited. Studies for handling delayed strokes have mainly focused on Arabic and Farsi scripts where the problem is most severe, with less attention devoted for scripts based on the Latin alphabet. This study aims to investigate the effectiveness of the delayed stroke handling methods proposed in the literature. Evaluated methods include the removal of delayed strokes and embedding delayed strokes in the correct writing order, together with their variations. Starting with new definitions of a delayed stroke, we tested each method using both hidden Markov model classifiers separately for English and Turkish and bidirectional long short-term memory networks for English. For both the UNIPEN and Turkish datasets, the best results are obtained with hidden Markov model recognizers by removing all delayed strokes, with up to 2.13% and 2.03% points accuracy increases over the respective baselines. In case of the bidirectional long short-term memory networks, stroke order correction of the delayed strokes by embedding performs the best, with 1.81% (raw) and 1.72% (post-processed) points improvements above the baseline

    A comparative study of delayed stroke handling approaches in online handwriting

    No full text
    Delayed strokes, such as i-dots and t-crosses, cause a challenge in online handwriting recognition by introducing an extra source of variation in the sequence order of the handwritten input. The problem is especially relevant for languages where delayed strokes are abundant and training data are limited. Studies for handling delayed strokes have mainly focused on Arabic and Farsi scripts where the problem is most severe, with less attention devoted for scripts based on the Latin alphabet. This study aims to investigate the effectiveness of the delayed stroke handling methods proposed in the literature. Evaluated methods include the removal of delayed strokes and embedding delayed strokes in the correct writing order, together with their variations. Starting with new definitions of a delayed stroke, we tested each method using both hidden Markov model classifiers separately for English and Turkish and bidirectional long short-term memory networks for English. For both the UNIPEN and Turkish datasets, the best results are obtained with hidden Markov model recognizers by removing all delayed strokes, with up to 2.13% and 2.03% points accuracy increases over the respective baselines. In case of the bidirectional long short-term memory networks, stroke order correction of the delayed strokes by embedding performs the best, with 1.81% (raw) and 1.72% (post-processed) points improvements above the baseline
    corecore